NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Documenting the Unwritten Curriculum of Student Research

Wilson, Shomir (August 2024, Association for Computational Linguistics)

Graduate and undergraduate student researchers in natural language processing (NLP) often need mentoring to learn the norms of research. While methodological and technical knowledge are essential, there is also a “hidden curriculum” of experiential knowledge about topics like work strategies, common obstacles, collaboration, conferences, and scholarly writing. As a professor, I have written a set of guides that cover typically unwritten customs and procedures for academic research. I share them with advisees to help them understand research norms and to help us focus on their specific questions and interests. This paper describes these guides, which are freely accessible on the web (https://shomir.net/advice), and I provide recommendations to faculty who are interested in creating similar materials for their advisees.
more » « less
Full Text Available
Incorporating Taxonomic Reasoning and Regulatory Knowledge into Automated Privacy Question Answering

Ravichander, Abhilasha; Yang, Ian; Chen, Rex; Wilson, Shomir; Norton, Thomas; Sadeh, Norman (November 2024, International Web Information Systems Engineering)

Full Text Available
Incorporating Taxonomic Reasoning and Regulatory Knowledge into Automated Privacy Question Answering

https://doi.org/10.1007/978-981-96-0579-8_31

Ravichander, Abhilasha; Yang, Ian; Chen, Rex; Wilson, Shomir; Norton, Thomas; Sadeh, Norman (November 2024, Springer Nature Singapore)

Privacy policies are often lengthy and complex legal documents, and are difficult for many people to read and comprehend. Recent research efforts have explored automated assistants that process the language in policies and answer people’s privacy questions. This study documents the importance of two different types of reasoning necessary to generate accurate answers to people’s privacy questions. The first is the need to support taxonomic reasoning about related terms commonly found in privacy policies. The second is the need to reason about regulatory disclosure requirements, given the prevalence of silence in privacy policy texts. Specifically, we report on a study involving the collection of 749 sets of expert annotations to answer privacy questions in the context of 210 different policy/question pairs. The study highlights the importance of taxonomic reasoning and of reasoning about regulatory disclosure requirements when it comes to accurately answering everyday privacy questions. Next we explore to what extent current generative AI tools are able to reliably handle this type of reasoning. Our results suggest that in their current form and in the absence of additional help, current models cannot reliably support the type of reasoning about regulatory disclosure requirements necessary to accurately answer privacy questions. We proceed to introduce and evaluate different approaches to improving their performance. Through this work, we aim to provide a richer understanding of the capabilities automated systems need to have to provide accurate answers to everyday privacy questions and, in the process, outline paths for adapting AI models for this purpose.
more » « less
Full Text Available
Understanding How to Inform Blind and Low-Vision Users about Data Privacy through Privacy Question Answering Assistants

Feng, Yuanyuan; Ravichander, Abhilasha; Yao, Yashing; Zhang, Shikun; Chen, Rex; Wilson, Shomir; Sadeh, Norman (August 2024, 33rd USENIX Security Symposium (USENIX Security 24) - USENIX Association)

Understanding and managing data privacy in the digital world can be challenging for sighted users, let alone blind and lowvision (BLV) users. There is limited research on how BLV users, who have special accessibility needs, navigate data privacy, and how potential privacy tools could assist them. We conducted an in-depth qualitative study with 21 US BLV participants to understand their data privacy risk perception and mitigation, as well as their information behaviors related to data privacy. We also explored BLV users’ attitudes towards potential privacy question answering (Q&A) assistants that enable them to better navigate data privacy information. We found that BLV users face heightened security and privacy risks, but their risk mitigation is often insufficient. They do not necessarily seek data privacy information but clearly recognize the benefits of a potential privacy Q&A assistant. They also expect privacy Q&A assistants to possess cross-platform compatibility, support multi-modality, and demonstrate robust functionality. Our study sheds light on BLV users’ expectations when it comes to usability, accessibility, trust and equity issues regarding digital data privacy.
more » « less
Full Text Available
Documenting the Unwritten Curriculum of Student Research

Wilson, Shomir (January 2024, Association for Computational Linguistics)

Graduate and undergraduate student researchers in natural language processing (NLP) often need mentoring to learn the norms of research. While methodological and technical knowledge are essential, there is also a “hidden curriculum” of experiential knowledge about topics like work strategies, common obstacles, collaboration, conferences, and scholarly writing. As a professor, I have written a set of guides that cover typically unwritten customs and procedures for academic research. I share them with advisees to help them understand research norms and to help us focus on their specific questions and interests. This paper describes these guides, which are freely accessible on the web (https://shomir.net/advice), and I provide recommendations to faculty who are interested in creating similar materials for their advisees.
more » « less
Full Text Available
Creation and Analysis of an International Corpus of Privacy Laws.

Gupta, Sonu; Gopi, Geetika; Balaji, Harish; Poplavska, Ellen; O’Toole, Nora; Arora, Siddhant; Norton, Thomas; Sadeh, Norman; Wilson, Shomir (June 2024, roceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024))

The landscape of privacy laws and regulations around the world is complex and ever-changing. National and super-national laws, agreements, decrees, and other government-issued rules form a patchwork that companies must follow to operate internationally. To examine the status and evolution of this patchwork, we introduce the Privacy Law Corpus, of 1,043 privacy laws, regulations, and guidelines, covering 183 jurisdictions. This corpus enables a large-scale quantitative and qualitative examination of legal focus on privacy. We examine the temporal distribution of when privacy laws were created and illustrate the dramatic increase in privacy legislation over the past 50 years, although a finer-grained examination reveals that the rate of increase varies depending on the personal data types that privacy laws address. Our exploration also demonstrates that most privacy laws respectively address relatively few personal data types. Additionally, topic modeling results show the prevalence of common themes in privacy laws, such as finance, healthcare, and telecommunications. Finally, we release the corpus to the research community to promote further study.
more » « less
Full Text Available
Online Self-Disclosure, Social Support, and User Engagement During the COVID-19 Pandemic

https://doi.org/10.1145/3617654

Lee, Jooyoung; Rajtmajer, Sarah; Srivatsavaya, Eesha; Wilson, Shomir (December 2023, ACM Transactions on Social Computing)

We investigate relationships between online self-disclosure and received social support and user engagement during the COVID-19 crisis. We crawl a total of 2,399 posts and 29,851 associated comments from the r/COVID19_support subreddit and manually extract fine-grained personal information categories and types of social support sought from each post. We develop a BERT-based ensemble classifier to automatically identify types of support offered in users’ comments. We then analyze the effect of personal information sharing and posts’ topical, lexical, and sentiment markers on the acquisition of support and five interaction measures (submission scores, the number of comments, the number of unique commenters, the length and sentiments of comments). Our findings show that: (1) users were more likely to share their age, education, and location information when seeking both informational and emotional support as opposed to pursuing either one; (2) while personal information sharing was positively correlated with receiving informational support when requested, it did not correlate with emotional support; (3) as the degree of self-disclosure increased, information support seekers obtained higher submission scores and longer comments, whereas emotional support seekers’ self-disclosure resulted in lower submission scores, fewer comments, and fewer unique commenters; and (4) post characteristics affecting audience response differed significantly based on types of support sought by post authors. These results provide empirical evidence for the varying effects of self-disclosure on acquiring desired support and user involvement online during the COVID-19 pandemic. Furthermore, this work can assist support seekers hoping to enhance and prioritize specific types of social support and user engagement.
more » « less
Full Text Available
Creation and Analysis of an International Corpus of Privacy Laws

Gupta, Sonu; Gopi, Geetika; Balaji, Harish; Poplavska, Ellen; O'Toole, Nora; Arora, Siddhant; Norton, Thomas; Sadeh, Norman; Wilson, Shomir (May 2024, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation)

The landscape of privacy laws and regulations around the world is complex and ever-changing. National and super-national laws, agreements, decrees, and other government-issued rules form a patchwork that companies must follow to operate internationally. To examine the status and evolution of this patchwork, we introduce the Privacy Law Corpus, of 1,043 privacy laws, regulations, and guidelines, covering 183 jurisdictions. This corpus enables a large-scale quantitative and qualitative examination of legal focus on privacy. We examine the temporal distribution of when privacy laws were created and illustrate the dramatic increase in privacy legislation over the past 50 years, although a finer-grained examination reveals that the rate of increase varies depending on the personal data types that privacy laws address. Our exploration also demonstrates that most privacy laws respectively address relatively few personal data types. Additionally, topic modeling results show the prevalence of common themes in privacy laws, such as finance, healthcare, and telecommunications. Finally, we release the corpus to the research community to promote further study.
more » « less
Full Text Available
Automated Detection and Analysis of Data Practices Using A Real-World Corpus

https://doi.org/10.18653/v1/2024.findings-acl.271

Srinath, Mukund; Narayanan_Venkit, Pranav; Badillo, Maria; Schaub, Florian; Giles, C; Wilson, Shomir (January 2024, Association for Computational Linguistics)

Privacy policies are crucial for informing users about data practices, yet their length and complexity often deter users from reading them. In this paper, we propose an automated approach to identify and visualize data practices within privacy policies at different levels of detail. Leveraging crowd-sourced annotations from the ToS;DR platform, we experiment with various methods to match policy excerpts with predefined data practice descriptions. We further conduct a case study to evaluate our approach on a real-world policy, demonstrating its effectiveness in simplifying complex policies. Experiments show that our approach accurately matches data practice descriptions with policy excerpts, facilitating the presentation of simplified privacy information to users.
more » « less
Full Text Available
Privacy Lost and Found: An Investigation at Scale of Web Privacy Policy Availability

https://doi.org/10.1145/3573128.3604902

Srinath, Mukund; Sundareswara, Soundarya; Venkit, Pranav; Giles, C. Lee; Wilson, Shomir (August 2023, DocEng '23: Proceedings of the ACM Symposium on Document Engineering 2023)

Legal jurisdictions around the world require organisations to post privacy policies on their websites. However, in spite of laws such as GDPR and CCPA reinforcing this requirement, organisations sometimes do not comply, and a variety of semi-compliant failure modes exist. To investigate the landscape of web privacy policies, we crawl the privacy policies from 7 million organisation websites with the goal of identifying when policies are unavailable. We conduct a large-scale investigation of the availability of privacy policies and identify potential reasons for unavailability such as dead links, documents with empty content, documents that consist solely of placeholder text, and documents unavailable in the specific languages offered by their respective websites. We estimate the frequencies of these failure modes and the overall unavailability of privacy policies on the web and find that privacy policies URLs are only available in 34% of websites. Further, 1.37% of these URLs are broken links and 1.23% of the valid links lead to pages without a policy. Further, to enable investigation of privacy policies at scale, we use the capture-recapture technique to estimate the total number of English language privacy policies on the web and the distribution of these documents across top level domains and sectors of commerce. We estimate the lower bound on the number of English language privacy policies to be around 3 million. Finally, we release the CoLIPPs Corpus containing around 600k policies and their metadata consisting of policy URL, length, readability, sector of commerce, and policy crawl date.
more » « less
Full Text Available

« Prev Next »

Search for: All records